Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.
Given breast cancer results from breast fine needle aspiration (FNA) test (is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst (a lump, sore or swelling) with a fine needle similar to a blood sample needle). Since this build a model that can classify a breast cancer tumor using two training classification:
Since the labels in the data are discrete, the predication falls into two categories, (i.e. Malignant or benign). In machine learning this is a classification problem.
Thus, the goal is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning classification methods to fit a function that can predict the discrete class of new input.
The Breast Cancer datasets is available machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.
Ten real-valued features are computed for each cell nucleus:
# importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
%matplotlib inline
sns.set_style('darkgrid')
# load dataset
df = pd.read_csv('data.csv')
df.head()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
df.shape
(569, 33)
df.tail()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 564 | 926424 | M | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | ... | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 | NaN |
| 565 | 926682 | M | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | ... | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 | NaN |
| 566 | 926954 | M | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | ... | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 | NaN |
| 567 | 927241 | M | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | ... | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 | NaN |
| 568 | 92751 | B | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | ... | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 | NaN |
5 rows × 33 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 32 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB
df.isna()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 564 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 565 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 566 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 567 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 568 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
569 rows × 33 columns
df.isna().any()
id False diagnosis False radius_mean False texture_mean False perimeter_mean False area_mean False smoothness_mean False compactness_mean False concavity_mean False concave points_mean False symmetry_mean False fractal_dimension_mean False radius_se False texture_se False perimeter_se False area_se False smoothness_se False compactness_se False concavity_se False concave points_se False symmetry_se False fractal_dimension_se False radius_worst False texture_worst False perimeter_worst False area_worst False smoothness_worst False compactness_worst False concavity_worst False concave points_worst False symmetry_worst False fractal_dimension_worst False Unnamed: 32 True dtype: bool
df.isna().sum()
id 0 diagnosis 0 radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 radius_worst 0 texture_worst 0 perimeter_worst 0 area_worst 0 smoothness_worst 0 compactness_worst 0 concavity_worst 0 concave points_worst 0 symmetry_worst 0 fractal_dimension_worst 0 Unnamed: 32 569 dtype: int64
df = df.dropna(axis='columns')
df.describe(include="O")
| diagnosis | |
|---|---|
| count | 569 |
| unique | 2 |
| top | B |
| freq | 357 |
df.diagnosis.value_counts()
B 357 M 212 Name: diagnosis, dtype: int64
using value_counts method we can see number of unique values in categorical type of feature.
df.head(2)
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.8 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 25.38 | 17.33 | 184.6 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 842517 | M | 20.57 | 17.77 | 132.9 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 24.99 | 23.41 | 158.8 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 rows × 32 columns
diagnosis_unique = df.diagnosis.unique()
diagnosis_unique
array(['M', 'B'], dtype=object)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.hist(df['diagnosis'])
plt.title("Counts of Diagnosis")
plt.xlabel("Diagnosis")
plt.subplot(1, 2, 2)
sns.countplot(x='diagnosis', data=df)
plt.title("Counts of Diagnosis")
plt.show()
px.histogram(df, x='diagnosis')
cols = ["diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean"]
sns.pairplot(df[cols], hue="diagnosis")
plt.show()
size = len(df['texture_mean'])
area = np.pi * (15 * np.random.rand( size ))**2
colors = np.random.rand( size )
plt.xlabel("texture mean")
plt.ylabel("radius mean")
plt.scatter(df['texture_mean'], df['radius_mean'], s=area, c=colors, alpha=0.5);
df.head(2)
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.8 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 25.38 | 17.33 | 184.6 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 842517 | M | 20.57 | 17.77 | 132.9 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 24.99 | 23.41 | 158.8 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 rows × 32 columns
labelencoder_Y = LabelEncoder()
df.diagnosis = labelencoder_Y.fit_transform(df.diagnosis)
df.head(2)
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | 1 | 17.99 | 10.38 | 122.8 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 25.38 | 17.33 | 184.6 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 842517 | 1 | 20.57 | 17.77 | 132.9 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 24.99 | 23.41 | 158.8 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 rows × 32 columns
print(df.diagnosis.value_counts())
print("\n", df.diagnosis.value_counts().sum())
0 357 1 212 Name: diagnosis, dtype: int64 569
Finnaly, We can see in this output categorical values converted into 0 and 1.
cols = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']
print(len(cols))
df[cols].corr()
11
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| diagnosis | 1.000000 | 0.730029 | 0.415185 | 0.742636 | 0.708984 | 0.358560 | 0.596534 | 0.696360 | 0.776614 | 0.330499 | -0.012838 |
| radius_mean | 0.730029 | 1.000000 | 0.323782 | 0.997855 | 0.987357 | 0.170581 | 0.506124 | 0.676764 | 0.822529 | 0.147741 | -0.311631 |
| texture_mean | 0.415185 | 0.323782 | 1.000000 | 0.329533 | 0.321086 | -0.023389 | 0.236702 | 0.302418 | 0.293464 | 0.071401 | -0.076437 |
| perimeter_mean | 0.742636 | 0.997855 | 0.329533 | 1.000000 | 0.986507 | 0.207278 | 0.556936 | 0.716136 | 0.850977 | 0.183027 | -0.261477 |
| area_mean | 0.708984 | 0.987357 | 0.321086 | 0.986507 | 1.000000 | 0.177028 | 0.498502 | 0.685983 | 0.823269 | 0.151293 | -0.283110 |
| smoothness_mean | 0.358560 | 0.170581 | -0.023389 | 0.207278 | 0.177028 | 1.000000 | 0.659123 | 0.521984 | 0.553695 | 0.557775 | 0.584792 |
| compactness_mean | 0.596534 | 0.506124 | 0.236702 | 0.556936 | 0.498502 | 0.659123 | 1.000000 | 0.883121 | 0.831135 | 0.602641 | 0.565369 |
| concavity_mean | 0.696360 | 0.676764 | 0.302418 | 0.716136 | 0.685983 | 0.521984 | 0.883121 | 1.000000 | 0.921391 | 0.500667 | 0.336783 |
| concave points_mean | 0.776614 | 0.822529 | 0.293464 | 0.850977 | 0.823269 | 0.553695 | 0.831135 | 0.921391 | 1.000000 | 0.462497 | 0.166917 |
| symmetry_mean | 0.330499 | 0.147741 | 0.071401 | 0.183027 | 0.151293 | 0.557775 | 0.602641 | 0.500667 | 0.462497 | 1.000000 | 0.479921 |
| fractal_dimension_mean | -0.012838 | -0.311631 | -0.076437 | -0.261477 | -0.283110 | 0.584792 | 0.565369 | 0.336783 | 0.166917 | 0.479921 | 1.000000 |
plt.figure(figsize=(12, 9))
plt.title("Correlation Graph")
cmap = sns.diverging_palette( 1000, 120, as_cmap=True)
sns.heatmap(df[cols].corr(), annot=True, fmt='.1%', linewidths=.05, cmap=cmap);
Using, Plotly Pacage we can show it in interactive graphs like this,
plt.figure(figsize=(15, 10))
fig = px.imshow(df[cols].corr());
fig.show()
<Figure size 1500x1000 with 0 Axes>
Feature Selection¶
Select feature for predictions
df.columns
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst'],
dtype='object')
prediction_feature = [ "radius_mean", 'perimeter_mean', 'area_mean', 'symmetry_mean', 'compactness_mean', 'concave points_mean']
targeted_feature = 'diagnosis'
len(prediction_feature)
6
X = df[prediction_feature]
X
| radius_mean | perimeter_mean | area_mean | symmetry_mean | compactness_mean | concave points_mean | |
|---|---|---|---|---|---|---|
| 0 | 17.99 | 122.80 | 1001.0 | 0.2419 | 0.27760 | 0.14710 |
| 1 | 20.57 | 132.90 | 1326.0 | 0.1812 | 0.07864 | 0.07017 |
| 2 | 19.69 | 130.00 | 1203.0 | 0.2069 | 0.15990 | 0.12790 |
| 3 | 11.42 | 77.58 | 386.1 | 0.2597 | 0.28390 | 0.10520 |
| 4 | 20.29 | 135.10 | 1297.0 | 0.1809 | 0.13280 | 0.10430 |
| ... | ... | ... | ... | ... | ... | ... |
| 564 | 21.56 | 142.00 | 1479.0 | 0.1726 | 0.11590 | 0.13890 |
| 565 | 20.13 | 131.20 | 1261.0 | 0.1752 | 0.10340 | 0.09791 |
| 566 | 16.60 | 108.30 | 858.1 | 0.1590 | 0.10230 | 0.05302 |
| 567 | 20.60 | 140.10 | 1265.0 | 0.2397 | 0.27700 | 0.15200 |
| 568 | 7.76 | 47.92 | 181.0 | 0.1587 | 0.04362 | 0.00000 |
569 rows × 6 columns
y = df.diagnosis
y
0 1
1 1
2 1
3 1
4 1
..
564 1
565 1
566 1
567 1
568 0
Name: diagnosis, Length: 569, dtype: int32
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=15)
print(X_train)
# print(X_test)
radius_mean perimeter_mean area_mean symmetry_mean compactness_mean \
274 17.93 115.20 998.9 0.1538 0.07027
189 12.30 78.83 463.7 0.1667 0.07253
158 12.06 76.84 448.6 0.1590 0.05241
257 15.32 103.20 713.3 0.2398 0.22840
486 14.64 94.21 666.0 0.1409 0.06698
.. ... ... ... ... ...
85 18.46 121.10 1075.0 0.2132 0.10530
199 14.45 94.49 642.7 0.1950 0.12060
156 17.68 117.40 963.7 0.1971 0.16650
384 13.28 85.79 541.8 0.1617 0.08575
456 11.63 74.87 415.1 0.1799 0.08574
concave points_mean
274 0.04744
189 0.01654
158 0.01963
257 0.12420
486 0.02791
.. ...
85 0.08795
199 0.05980
156 0.10540
384 0.02864
456 0.02017
[381 rows x 6 columns]
Standardize features by removing the mean and scaling to unit variance
The standard score of a sample x is calculated as:
# Scale the data to keep all the values in the same magnitude of 0 -1
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
def model_building(model, X_train, X_test, y_train, y_test):
"""
Model Fitting, Prediction And Other stuff
return ('score', 'accuracy_score', 'predictions' )
"""
model.fit(X_train, y_train)
score = model.score(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(predictions, y_test)
return (score, accuracy, predictions)
Let's make a dictionary for multiple models for bulk predictions
models_list = {
"LogisticRegression" : LogisticRegression(),
"RandomForestClassifier" : RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=5),
"DecisionTreeClassifier" : DecisionTreeClassifier(criterion='entropy', random_state=0),
"SVC" : SVC(),
}
print(list(models_list.keys()))
print(list(models_list.values()))
['LogisticRegression', 'RandomForestClassifier', 'DecisionTreeClassifier', 'SVC'] [LogisticRegression(), RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=5), DecisionTreeClassifier(criterion='entropy', random_state=0), SVC()]
Now, Train the model one by one and show the classification report of perticular models wise.
Let's Define the function for confision metric Graphs
def cm_metrix_graph(cm):
sns.heatmap(cm,annot=True,fmt="d")
plt.show()
df_prediction = []
confusion_matrixs = []
df_prediction_cols = [ 'model_name', 'score', 'accuracy_score' , "accuracy_percentage"]
for name, model in zip(list(models_list.keys()), list(models_list.values())):
(score, accuracy, predictions) = model_building(model, X_train, X_test, y_train, y_test )
print("\n\nClassification Report of '"+ str(name), "'\n")
print(classification_report(y_test, predictions))
df_prediction.append([name, score, accuracy, "{0:.2%}".format(accuracy)])
# For Showing Metrics
confusion_matrixs.append(confusion_matrix(y_test, predictions))
df_pred = pd.DataFrame(df_prediction, columns=df_prediction_cols)
Classification Report of 'LogisticRegression '
precision recall f1-score support
0 0.90 0.96 0.93 115
1 0.92 0.84 0.88 73
accuracy 0.91 188
macro avg 0.91 0.90 0.90 188
weighted avg 0.91 0.91 0.91 188
Classification Report of 'RandomForestClassifier '
precision recall f1-score support
0 0.92 0.96 0.94 115
1 0.93 0.88 0.90 73
accuracy 0.93 188
macro avg 0.93 0.92 0.92 188
weighted avg 0.93 0.93 0.93 188
Classification Report of 'DecisionTreeClassifier '
precision recall f1-score support
0 0.90 0.96 0.93 115
1 0.92 0.84 0.88 73
accuracy 0.91 188
macro avg 0.91 0.90 0.90 188
weighted avg 0.91 0.91 0.91 188
Classification Report of 'SVC '
precision recall f1-score support
0 0.90 0.97 0.93 115
1 0.94 0.84 0.88 73
accuracy 0.91 188
macro avg 0.92 0.90 0.91 188
weighted avg 0.92 0.91 0.91 188
print(len(confusion_matrixs))
4
# Assuming confusion_matrixs is a list of confusion matrices
plt.figure(figsize=(15, 5))
for index, cm in enumerate(confusion_matrixs):
plt.subplot(1, len(confusion_matrixs), index + 1) # Create a subplot
cm_metrix_graph(cm) # Call the Confusion Metrics Graph for the current confusion matrix
plt.title(f'Confusion Matrix {index + 1}') # Set the title for the subplot
plt.tight_layout()
plt.show()
C:\Users\maashree\AppData\Local\Temp\ipykernel_16308\830019773.py:6: MatplotlibDeprecationWarning: Auto-removal of overlapping axes is deprecated since 3.6 and will be removed two minor releases later; explicitly call ax.remove() as needed.
df_pred
| model_name | score | accuracy_score | accuracy_percentage | |
|---|---|---|---|---|
| 0 | LogisticRegression | 0.916010 | 0.909574 | 90.96% |
| 1 | RandomForestClassifier | 0.992126 | 0.925532 | 92.55% |
| 2 | DecisionTreeClassifier | 1.000000 | 0.909574 | 90.96% |
| 3 | SVC | 0.923885 | 0.914894 | 91.49% |
df_pred.sort_values('score', ascending=False)
df_pred.sort_values('accuracy_score', ascending=False)
| model_name | score | accuracy_score | accuracy_percentage | |
|---|---|---|---|---|
| 1 | RandomForestClassifier | 0.992126 | 0.925532 | 92.55% |
| 3 | SVC | 0.923885 | 0.914894 | 91.49% |
| 0 | LogisticRegression | 0.916010 | 0.909574 | 90.96% |
| 2 | DecisionTreeClassifier | 1.000000 | 0.909574 | 90.96% |